Data generation and processing

Sequencing protocols, best practice variant calling and filtering

Per Unneberg

NBIS

12/17/22

Data generation

Population genomics - the data

Since the goal of population genomics is to analyze variation in a set of individuals, data generation consists of compiling variation data from individuals. Here the focus is on next-generation sequencing data.

Figure 1: Sequencing cost ($) per megabase (Wetterstrand, KA)

Lou et al. (2021)

RADSeq

High coverage sequencing data

Pros: call genotypes with confidence

cons: cost

Low coverage sequencing data

lcWGS - alleviates cost issue

Read mapping and variant calling

GATK best practice

Point out: not optimal for non-model organims

Alternative variant callers

freebayes

bcftools

ANGSd

PoolSeq

Popoolation

Variant filtering

General filters

Table 1: Key data filters (Table 3 Lou et al., 2021, p. 5974)
Category Filter Recommendation (examples)
General filters Base quality Recalibrate / <Q20
Mapping quality MAQ < 20 / improper pairs
Minimum depth and/ or number of individuals Varies; e.g. <50% individuals, <0.8X average depth
Maximum depth 1-2 sd above median depth
Duplicate reads Remove
Indels Realign reads / haplotype-based caller / exclude bases flanking indels
Overlapping sections of paired-­end reads Soft-clip to avoid double-counting
Filters on polymorphic sites \(p\)-value \(10^{-6}\)
SNPs with more than two alleles Filter; methods often assume bi-allelic sites
Minimum minor allele frequency (MAF) 1%-10% for some analyses (PCA/admixture/LD/\(\mathsf{F_{ST}}\)
Restricting analysis to a predefined site list List of global SNPs Use global call set for analyses requiring shared sites

Gentotype likelihoods

Refs

Li, H. (2014). Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics, 30(20), 2843–2851. https://doi.org/10.1093/bioinformatics/btu356
Lou, R. N., Jacobs, A., Wilder, A. P., & Therkildsen, N. O. (2021). A beginner’s guide to low-coverage whole genome sequencing for population genomics. Molecular Ecology, 30(23), 5966–5993. https://doi.org/10.1111/mec.16077
Talla, V., Soler, L., Kawakami, T., Dincă, V., Vila, R., Friberg, M., Wiklund, C., & Backström, N. (2019). Dissecting the Effects of Selection and Mutation on Genetic Diversity in Three Wood White (Leptidea) Butterfly Species. Genome Biology and Evolution, 11(10), 2875–2886. https://doi.org/10.1093/gbe/evz212
Wetterstrand, KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). www.genome.gov/sequencingcostsdata